Impact of inconsistent ethnicity recordings on estimates of inequality in child health and education data: a data linkage study of Child and Adolescent Mental Health Services in South London

Objectives Ethnicity data are critical for identifying inequalities, but previous studies suggest that ethnicity is not consistently recorded between different administrative datasets. With researchers increasingly leveraging cross-domain data linkages, we investigated the completeness and consistency of ethnicity data in two linked health and education datasets. Design Cohort study. Setting South London and Maudsley NHS Foundation Trust deidentified electronic health records, accessed via Clinical Record Interactive Search (CRIS) and the National Pupil Database (NPD) (2007–2013). Participants N=30 426 children and adolescents referred to local Child and Adolescent Mental Health Services. Primary and secondary outcome measures Ethnicity data were compared between CRIS and the NPD. Associations between ethnicity as recorded from each source and key educational and clinical outcomes were explored with risk ratios. Results Ethnicity data were available for 79.3% from the NPD, 87.0% from CRIS, 97.3% from either source and 69.0% from both sources. Among those who had ethnicity data from both, the two data sources agreed on 87.0% of aggregate ethnicity categorisations overall, but with high levels of disagreement in Mixed and Other ethnic groups. Strengths of associations between ethnicity, educational attainment and neurodevelopmental disorder varied according to which data source was used to code ethnicity. For example, as compared with White pupils, a significantly higher proportion of Asian pupils achieved expected educational attainment thresholds only if ethnicity was coded from the NPD (RR=1.46, 95% CI 1.29 to 1.64), not if ethnicity was coded from CRIS (RR=1.11, 0.98 to 1.26). Conclusions Data linkage has the potential to minimise missing ethnicity data, and overlap in ethnicity categorisations between CRIS and the NPD was generally high. However, choosing which data source to primarily code ethnicity from can have implications for analyses of ethnicity, mental health and educational outcomes. Users of linked data should exercise caution in combining and comparing ethnicity between different data sources.


Introduction
study -Give the eligibility criteria, and the sources and methods of selection of participants.Describe methods of follow-up (b) Cohort study -For matched studies, give matching criteria and number of exposed and unexposed 5 RECORD 6.1:The methods of study population selection (such as codes or algorithms used to identify subjects) should be listed in detail.If this is not possible, an explanation should be provided.RECORD 6.2: Any validation studies of the codes or algorithms used to select the population should be referenced.If validation 5 BMJ Publishing Group Limited (BMJ) disclaims all liability and responsibility arising from any relianceSupplemental material placed on this supplemental material which has been supplied by the author(s)BMJ Open doi: 10.

Table S2 : Ethnic groups in the NPD. Approved extended categories Minor ethnic group Major ethnic group
Benchimol, E. I., Smeeth, L., Guttmann, A., Harron, K., Moher, D., Petersen, I., ... & RECORD Working Committee.(2015).The REporting of studies Conducted using Observational Routinely-collected health Data (RECORD) statement.PLoS Med, 12(10), e1001885.BMJ Publishing Group Limited (BMJ) disclaims all liability and responsibility arising from any reliance Supplemental material placed on this supplemental material which has been supplied by the author(s) ResultsParticipants 13 (a) Report the numbers of individuals at each stage of the study (e.g., numbers potentially eligible, examined for eligibility, confirmed eligible, included in the study, completing follow-up, and analysed) (b) Give reasons for non-participation at each stage.(c) Consider use of a flow diagram Supp leme nt RECORD 13.1: Describe in detail the selection of the persons included in the study (i.e., study population selection) including filtering based on data quality, data availability and linkage.The selection of included persons can be described in the text and/or by means of the study flow diagram.Descriptive data 14 (a) Give characteristics of study participants (e.g., demographic, clinical, social) and information on exposures and potential confounders (b) Indicate the number of participants with missing data for each variable of interest (c) Cohort study -summarise follow-up time (e.g., average and total amount) *Checklist is protected under Creative Commons Attribution (CC BY) license.

Table S5 : Proportion of disaggregated CRIS ethnicities categorised as consistent major ethnic groups in the NPD, among individuals where ethnicity was available from both sources (n=20,916).
Row percentages provided.Off-diagonal proportions have been suppressed to avoid disclosive cell counts.For the purposes of this table, we have excluded pupils who were categorised as Chinese in either data source, in order to avoid disclosive cell counts.

Table S6 : Unadjusted risk ratios between ethnicity derived from either NPD or CRIS (exposure) and whether the Year 11 expected attainment threshold was achieved (outcome; 'no' is the reference group) NPD-derived ethnicity as exposure, supplemented from CRIS if missing CRIS-derived ethnicity as exposure, supplemented from NPD if missing
The n=15,859 not included in these analyses were missing data on Year 11 attainment.The n=817 in the missing ethnicity group are those with ethnicity data available neither in CRIS, nor the NPD.Abbreviations: CI=Confidence Interval, CRIS=Clinical Record Interactive Search, NPD=National Pupil Database, RR=Risk Ratio.

Table S7 : Unadjusted risk ratios between ethnicity derived from either NPD or CRIS (exposure) and neurodevelopmental disorder diagnosis (outcome; 'no' is the reference group) NPD-derived ethnicity as exposure, supplemented from CRIS if missing CRIS-derived ethnicity as exposure, supplemented from NPD if missing
The n=817 in the missing ethnicity group are those with ethnicity data available neither in CRIS, nor the NPD.Abbreviations: CI=Confidence Interval, CRIS=Clinical Record Interactive Search, NPD=National Pupil Database, RR=Risk Ratio.BMJ Publishing Group Limited (BMJ) disclaims all liability and responsibility arising from any reliance Supplemental material placed on this supplemental material which has been supplied by the author(s)